Individual Disclosure Risk Measures Based on Log-Linear Models
ثبت نشده
چکیده
Dissemination of microdata files should be constrained to the confidentiality pledge under which a statistical agency collects survey data. To protect the confidentiality of respondents, statistical agencies perform a two-stage statistical disclosure control procedure. In the first stage, with respect to a disclosure scenario, the risk of disclosure of each unit is estimated. After the removal of direct identifiers, e.g. name and address, other indirect identifiers, called key variables, could still allow the disclosure of some confidential information about a unit. Usually, most of the key variables registered in social microdata files are categorical. An important problem in statistical disclosure control (SDC) is the estimation of the (number of) sample uniques that are also population uniques, i.e. units at risk of disclosure. In this paper, extensions of the Poisson-loglinear model to estimate a disclosure risk measure in contingency tables are presented. The main contribution is the development of smoothing strategies based on a penalized likelihood approach and on graphical log-linear models decomposition. Results of several tests performed on Italian 2001 census data will be presented.
منابع مشابه
Bayesian Nonparametric Disclosure Risk Estimation via Mixed Effects Log-linear Models
Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the identification and analyze multi-way contingency tables of such variables. Common disclosure risk...
متن کاملA CRONYM : Data without Boundaries D
Disclosure limitation methods for protecting the confidentiality ofrespondents in survey microdata often use perturbative techniques whichintroduce measurement error into the categorical identifying variables. Inaddition, the data itself will often have measurement errors commonly arisingfrom survey processes. There is a need for valid and practical ways to assess theprotect...
متن کاملDisclosure Risk Measurement with Entropy in Two-Dimensional Sample Based Frequency Tables
We extend a disclosure risk measure defined for population based frequency tables to sample based frequency tables. The disclosure risk measure is based on information theoretical expressions, such as entropy and conditional entropy, that reflect the properties of attribute disclosure. To estimate the disclosure risk of a sample based frequency table we need to take into account the underlying ...
متن کاملAlgebraic Statistics and Contingency Table Problems: Log-linear Models, Likelihood Estimation, and Disclosure Limitation
Contingency tables have provided a fertile ground for the growth of algebraic statistics. In this paper we briefly outline some features of this work and point to open research problems. We focus on the problem of maximum likelihood estimation for log-linear models and a related problem of disclosure limitation to protect the confidentiality of individual responses. Risk of disclosure has often...
متن کاملEstimating Identification Disclosure Risk Using Mixed Membership Models.
Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence,...
متن کامل